**Loop unrolling and pipelining**

Anh Huy Bui

293257

**Question 1: Go to the “Schedule” view and hover the cursor over different loops in the loop hierarchy. What do the following items in the appearing box mean?**

C-steps: Control steps – Steps or the number of steps required to execute the operations.

Iterations: how many loops executed before exiting.

Cycles In: C-steps\*Iterations

Total cycles in: The cycles of current loop (inner loops/operations are not included)

Total cycles under: The cycles of inner loop

Total cycles: Cycles to execute whole loop (Total cycles in + Total cycles under)

Throughput Period: How many cycles it takes from the input to output

**Question 2: Draw a table of the architectures with their area scores and throughputs. Clearly indicate which loops you unrolled/pipelined. Which architecture gives the best throughput/area trade-off in your opinion? Explain why**

|  |  |  |  |
| --- | --- | --- | --- |
|  | Throughput | Latency | Area |
| Pipeline innermost loop (II = 1) | 2594 | 2591 | 269601.75 |
| Pipeline 1st nested loop (II = 1) | 2114 | 2111 | 226574.68 |
| Pipeline outermost loop (II = 1) | 2054 | 2052 | 134428.8 |
| Pipeline main (II = 1) | 512 | 512 | 74287.02 |
| Fully unroll all loop | 65 | 64 | 186026.18 |
| Fully unroll outermost loop | 2114 | 2112 | 156895.23 |
| Fully unroll 1st nested loop | 2338 | 2336 | 186397.93 |
| Unroll the innermost loop with (U=4) | 2594 | 2591 | 279684,70 |
| Fully unroll the innermost loop | 2594 | 2591 | 289452,00 |
| Unroll the innermost loop with (U=2) and Pipeline (II = 1) | 1570 | 1567 | 276761.19 |
| Fully unloop all loop with Pipeline II = 1 in outermost loop | 18 | 16 | 912606 |
| Pipeline main (II = 1) +  Fully unroll innermost loop | 64 | 64 | 148060.93 |
| Pipeline main (II = 1) +  Fully unroll all loop | 1 | 1 | 3702291.25 |

Pipeline main (II = 1) + Fully unroll innermost loop gave me best trade-off.

Pipeline main gives significant improvement in throughput while make only slight compensation in area. Unroll inner loop gives even better throughput while the compensation is acceptable in my opinion.

**Question 3: Find an architecture, which yields a throughput of one. Which loop unrolling and pipelining options do you have to choose? What is the area score?**

So as to get a throughput of one, all loops have to be unrolled. Moreover, main has to be pipelined to reduce throughput period. Pipeline main (II = 1) + Fully unroll all loop returns as requested.

Area: 3702291.25.

**Question 4: How long did it take you to complete this exercise? How would you evaluate time usage compared to if you had done this exploration using RTL methods (i.e. writing VHDL or Verilog code for different architectures)?**

I needed 4-5 hours to complete this. It was mostly because of step RTL.  
However, if I have to write the code manually (unrolling and pipelining), it will take much more time.